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1.2 Decision Theory. Usually in statistics, instead of just two possible probability 
distributions P, Q, as in the last section, there is an infinite family V of such distributions, 
defined on a sample space, which is a measurable space {X,B), in other words a set X 
together with a cr-algebra B of subsets of X. As noted previously, if X is a subset of a 
Euclidean space, then B will usually be the cr-algebra of Borel subsets of X. If X is a 
countable set, then B will usually be the cr-algebra of all subsets of X (if also X C M^, 
then all its subsets are in fact Borel sets). A probability measure on B will be called a 
law. The family V of laws on (X^B) is usually written as {Pq, ^ € 0}, where 6 is called 
a parameter space. For example, if V is the set of all normal measures N{p., a^) for |U G M 
and o" > 0, we can take 9 — dj-^a) or {iJ,,cr'^) where in either case is the open upper 
half-plane, that is, the set of all {t, u) E such that u > 0. We assume that the function 
6 ^ Pq from to laws on B is one-to-one, in other words Pq ^ P^j, whenever 6 ^ (j) in 
0. So the sets V and are in 1-1 correspondence and any structure on one can be taken 
over to the other. We also assume given a cr-algebra T of subsets of 0. Most often 
will be a subset of some Euclidean space and T the family of Borel subsets of 0. The 
family {P^, ^ e 0} will be called measurable on (0, T) if and only if for each B e B, the 
function 9 i— > Pe{B) is measurable on 0. If is finite or countable, then (as with sample 
spaces) T will usually be taken to be the collection of all its subsets. In that case the 
family {Pg, ^ G 0} is always measurable. 

An observation will be a point x of X. Given x, the statistician tries to make inferences 
about 9, such as estimating 6' by a function 9{x). For example, if X = R"' and Pq = 
A'"(6', 1)", so a; = (Xi,... ,X„) where the Xj are i.i.d. with distribution N{9,1), then 
9{x) = X := (Xi -I Xn)/n is the classical estimator of 6'. 

In decision theory, there is also a measurable space {D,S), called the decision space. 
A measurable function d{-) from X into D is called a decision rule. Such a rule says that 
if X is observed, then action d{x) should be taken. 

One possible decision space D would be the set of all cZ^ for ^ G 0, where do is the 
decision (estimate) that 9 is the true value of the parameter. Or, if we just have a set V 
of laws, then dp would be the decision that P is the true law. Thus in the last section we 
had V — {P,Q} and for non-randomized tests, D — {dpjdq}. There, a decision rule is 
equivalent to a measurable subset of X, which was taken to be the set where the decision 
will be dq. For randomized rules, still for V = {P, Q}, the decision space D can be taken 
as the interval < d < 1, where d{x) is the probability that Q will be chosen if x is 
observed. 

Another possible decision space is a collection of subsets of the parameter space. 
Suppose Pe = N{9,a'^)'^ on X = M*^ for — oo < 9 < oo where a is fixed and known. Then 
[X - 1.96a-/n^/2, X + 1.96a-/n^/2] is a "95% confidence interval for meaning that for aU 
9, Pe{\X -9\> 1.96(T/n^/2} = 0.05. Here the decision space D could be taken as the set 
of all closed intervals [a, b] C M. Giving a confidence interval (or a "confidence set" more 
generally) is one kind of decision rule. 

In decision theory, we also assume given a loss function L, which is a measurable 
function: x D — > [0,oo]. Here L{9,d) is the loss suffered when the decision d is taken 
and 9 is the true value of the parameter, sometimes called the "state of nature." The 
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following condition on a loss function will be noted for later reference, though not always 
assumed: 

(1.2.1) L{P, dp) = for every P eV, in other words L(Pe, dg) = for all 9 e Q, 

that is, a correct decision incurs no loss. If the decision rule is an estimator T{x) = 6{x) for 
a real parameter 9, then one frequently used loss function is squared-error loss, L{9,t) = 
{9-t)\ 

Many authors treat a utility function U {9, d) rather than a loss function. Here larger 
values of U are more favorable. If L is a loss function, then U = —L gives a utility 
function, but not necessarily conversely: in accord with broad usage of the term among 
statisticians and economists, a utility function may take values positive, negative or zero, 
and the values U{P,dp) may not be and may be different for different P. To give a 
mathematical definition, a utility function is a measurable, extended real valued function 
(i.e. its values are in [— oo, oo]) on x D, or equivalently on © x D. If for P and Q in V, 
dp E D and dq E D, then it is assumed that 

(1.2.2) U{P,dQ)<U{P,dp), 

that is, it's better to make the right decision than a wrong one. Values (/{P^dq) = —oo 
are allowed, corresponding to the notion that a wrong decision could lead to "ruin" or 
"death" of the decision maker and possibly others. 

U D = {dp : P E V} and U{P,dp) < oo for all P, we can get a corresponding loss 
function satisfying (1.2.1) by setting 

L{P,dQ) = U{P,dp)-U{P,dQ). 

As this suggests, a loss function measures how far off a decision was, relative to other 
possible decisions. Such an evaluation seems natural for statistics. A utility function, 
by contrast, measures outcomes on a more absolute scale, incorporating the possibility 
that some values of 9 are more favorable than others, as is natural in economics. Loss 
functions and, especially, utility functions, reflect the preferences of the individual making 
the decision. 

The risk of a decision rule e(-) at 9 is defined by 

r{9,e) := J L{9,e{x))dPe{x), 

that is, risk is expected loss. If / and g are two decision rules, let / ^ mean that 
f{9i f) < r{9, g) for all 9. Also, / improves on g, written f ~< g, will mean that f ^ g and 
for some 6, r{6, f) < r(9,g). A decision rule g is called inadmissible if there is some rule 
/ with f ~< g. If there is no such /, then g is called admissible. A class H of decision rules 
will be called complete if for every decision rule g not in H there is axv h E H with h -< g. 
If only h ^ g, then H is called essentially complete. 

As in the case of deciding between two laws in Sec. 1.1, there are randomized decision 
rules. For such rules, the decision space doesn't consist only of definite, specific decisions 
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such as dp. Instead, D contains probability measures on a space A of specific actions. If 
X is observed, and d{x) G -D is a law Uj; on A, one will choose an action a G A at random 
according to z/^;, then take the action a. If A = {dp^dq}, a law on A is given by a 
number y with < y < 1, where y — i^idg) — 1 — ^{dp). 

In Sec. 1.1, we saw that randomized rules gave us admissible tests of all possible sizes, 
but also that for Bayes rules it was not necessary to randomize (Remark 1.1.9). 

A decision rule e(-), which may be randomized, is called minimax if it minimizes the 
maximum risk, or mathematically 



where the infimum is over all possible decision rules, possibly randomized. 

Minimax and randomized rules are both important in another subject closely related 
to decision theory, game theory. Here there are actions a E A, parameters 6* G 0, and a 
utility function U {9, a). In game theory, at any rate in the basic form of it being treated 
here, there is no sample space X, and so no observation x nor laws Pq. Also, an intelligent 
opponent can choose 9, possibly even knowing one's (randomized) decision rule, although 
not one's specific action a. Thus, if there are minimax decision rules, then such rules 
should be used. 

For example, in the game of "scissors-stone-paper," each of two players has available 
three possible actions, "scissors", "stone" and "paper," and the two take actions simulta- 
neously. Here scissor beats paper, paper beats stone, and stone beats scissors. Suppose 
the winner of one round gains $1 and the loser loses $1. If both players take the same 
action, the outcome is a draw (no money changes hands). Then for repeated plays, one 
should use a randomized rule: if one always takes the same action, or any predictable 
sequence of actions, the opponent can always win. The randomized rule of choosing each 
action with probability 1/3 has average winnings against any strategy of the opponent, 
and this strategy is minimax and is the unique minimax strategy: any other randomized 
rule, if known to the opponent, can be defeated and result in an average net loss (this is 
left as a problem). 

Even when minimax rules are not available, it may be that for each specific action 
a G A, there is a large loss L(0, a) for some 9. Then, possibly, any non-randomized 
decision rule choosing a E A has a large risk for some 9. As in the insurance business, risks 
above a certain size may be unacceptable, so that one may prefer to choose a randomized 
rule d{-) to keep supg r{9, d) from being too large. In insurance, unlike simpler problems, 
"minimax" seems too extreme a requirement: policies would only be sold if the buyers 
could be persuaded to pay more in premiums than they could ever receive in benefits! 

For a measurable space (A, £) of specific actions, let be the set of all probability 
measures on [A^S). On let Ss be the smallest cr-algebra for which all the evaluations 
V I— > ^{B) are measurable for B e £. For a loss function L on x ^, and u G -Df, the loss 
at 9 and v will be 
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The next fact extends the joint measurability of L from simple to randomized strategies. 
Recall that for two measurable spaces {U,U) and (V, V), a function F on U x F is called 



supgr{9,e) = Md(.)Supgr{9,d), 



(1.2.3) 
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jointly measurable if and only if it is measurable for the product cr-algebra U <^V, which is 
the smallest cr-algebra of subsets oiU xV containing all sets B xC ior B eU and C e V. 

1.2.4 Proposition. If L is nonnegative and jointly measurable on Q x A, then as defined 
on X Dg by (1.2.3), it is jointly measurable into [0, oo], for the cr-algebra on Dg. 

If / is a jointly measurable real-valued function on Qx A and Wf is the set of all {9, u) 
such that f{9, p) :— J f{9, a)di'{a) is well-defined (not oo — oo), then Wf is a measurable 
set for the product cr-algebra and (6', v) i-^ f{9, u) is jointly measurable on it for T and Ss 
into [—00,00]. 

Proof. Measurability holds if L = IbxC where lBxc{9,a) = 1b(6')1c(«) for S e T and 
C e since then L{9, = 1b{9)h{C), which is measurable for T ® Sg. Next, any finite 
union W of sets Bi x Ci for Bi E T and Ci E £ can be written as a finite, disjoint union 
of such sets (RAP, Prop. 3.2.2). Adding their indicators, we get the result for W . 

The set of all such W forms an algebra (RAP, Prop. 3.2.3). If L„ is a uniformly 
bounded sequence of measurable functions on x ^ for which the conclusion holds, and 
Lfi^L or LnlL, then it also holds for L. Since the smallest monotone class including an 
algebra is a cr-algebra (RAP, Theorem 4.4.2), the result holds for 1h for all H in the 
product cr-algebra T ® £. Then it holds for L simple (a finite linear combination of such 
1h), then for L nonnegative and measurable by monotone convergence (RAP, Prop. 4.1.5 
and Theorem 4.3.2). 

Then for any measurable / on x A, we write as usual / = — /~ where /"*" := 
max(/, 0). Then Wf is the set of (6*, z/) such that / f~^{9,a)du{a) and / f~ {9,a)du{a) are 
not both +00, so it is a product measurable set. Applying the result for L > to /"*" and 
/_, we get that on Wf, f{9,u) is a difference of two nonnegative measurable functions, 
not both -|-oo, so the difference is a measurable function into [—00, 00]. □ 

Remark. Although loss functions are assumed nonnegative, utility functions can be pos- 
itive or negative, so Proposition 1.2.4 can be applied when / is a utility function. 

A function u. : x ^ Vx from X into Dg will be called a randomized decision rule if it 
is measurable from {X,B) to {Dg,Sg). The risk of the rule for a given 9 is 



which is a measurable function of 9. The definitions of admissible and inadmissible rules 
extend directly to Dg in place of A. Conversely, a randomized decision rule may be viewed 
as a non-randomized one with the space Dg in place of the action space A, and the risk 
r{9, v) in place of the loss L{9, a). So, let's just consider ordinary decision rules a(-). 

Given a prior distribution tt on (0,T), the risk of a decision rule a(-) is defined as 
r(a(-)) := // L{9,a{x))dPe{x)dTi{9). A decision rule a(-) will be called a Bayes rule if 
it attains the minimum risk (for the given tt) over all possible decision rules, and if this 
minimum risk is finite. 

Even for non-Bayesian statisticians, who don't believe priors should be used in prac- 
tice, at least not in all cases, priors can be useful technically in showing that a decision 
rule is admissible: 
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1.2.5 Theorem. Suppose a(-) is a decision rule and for some prior tt on 6, a(-) is Bayes 
for TT and is unique Bayes, meaning that for any other Bayes decision rule b{-) for tt, and 
all 6, a{x) = b{x) for Pg-almost all x. Then a(-) is admissible. 

Proof. If there were a decision rule c(-) -< a(-), then clearly c(-) is also Bayes, but for 
some 9, c{x) ^ a{x) with positive probability for Pq^ a contradiction. □ 

1.2.6 Theorem. Suppose the parameter space © is countable. If a(-) is a decision rule 
and there is a prior tt on 0, positive at every point, such that a(-) is Bayes for tt, then a(-) 
is admissible. 

Proof. If were another decision rule with b -< a, then the risk of for tt would be 
smaller than that of a(-), a contradiction. □ 

A set C in a Euclidean space R'^ is called convex if whenever x,y & C and < t < 1 

we have tx + {1 — t)y G C. 

1.2.7 Proposition. Let {X^B) be the sample space and {A,£) the action space where A 
is a convex, Borel measurable subset of a Euclidean space M.^ with Borel cr-algebra £. Let 
ll*^!! •— ('^i + ■ ■ ■ + ci'kV^'^- Let X i-^ Ux : X — > Ds be a randomized decision rule. Then 
Bi, := {x & X : J \\a\\di^x{0') < oo} e B. Let a{x) := J advx{a) for x e Bjj , where 
integration of vectors is coordinatewise. Then x i— > a{x) is measurable from Bj^ into A. 

Proof. By Proposition 1.2.4 (for a fixed 9), v f \\a\\di'{a) is measurable from (Ds^Ss) 
into [0, cx)]. Since x h-* is measurable from {X,B) to {Ds^S^), it follows that B,^ e B. 
For X G -Bjy. , the integral a{x) is well-defined and has no infinite coordinates. We have 
a{x) G A by the proof of Jensen's inequality (RAP, 10.2.6). By Proposition 1.2.4 (again 
for a single 6*), x i— > a{x) is measurable. □ 

PROBLEMS 

1. Show that whenever the set of all admissible decision rules is essentially complete, it is 
actually complete and is the smallest complete class. 

2. In the situation of Sec. 1.1 where decision rules are randomized tests of P vs. Q, 

(a) Show that there is a smallest complete class and describe it. (Hint: use the result 
of Problem 1.) 

(b) Show that for some P and Q there is an essentially complete class smaller than 
the class in (a). In terms of P and Q, describe an essentially complete class which is 
as small as possible. 

3. Show that the set of all admissible decision rules may not be essentially complete and in 
fact, may be empty. Hint: without any sample or parameter space, let the action space 
be the set of positive real numbers a with loss function L(a) = 1/a. Describe what are 
the complete and essentially complete classes in this case. 

4. Prove that if a player of scissors-stone-paper has a known randomized strategy anything 
other than playing each action with probability 1/3, the opponent can win on the average 
for some suitable strategy. 
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5. Let Pe = N{6, 1)"^ on for — oo < ^ < oo. Let the action space be R and the loss 
function L{6^a) = {a — 9)^. Show that a decision rule (estimator) which is a linear 
function of X, d{x) = bX + c, is inadmissible if |6| > 1 but admissible if 6 = 0. (It is 
admissible for 6 = 1 and c — 0, but this problem doesn't ask for a proof of that.) 

NOTES TO SEC. 1.2 

A classic reference on decision theory is Ferguson (1967). A more recent work is 
Berger (1980, 1985). Decision theory was also the subject of the last chapter in each of 
two general texts by other leading statisticians: Bickcl and Doksum (1977) and Cox and 
Hinkley (1974). In the second edition of Bickel and Doksum (2001), decision theory is 
more integrated into the book, beginning with the first chapter, as it is here. 
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